In the second Project of my Data Analysis Nanodegree, I am investigating a Movie Database (TMDb) file, which originated from Kaggle. This Dataset has collection of important detials of about 10,000 movies, including their budget, genre, popularity and more.
Information about all given Datasets available you can find on the homepage-links present here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
import plotly.express as px
import squarify # pip install squarify (algorithm for treemap)
from gapminder import gapminder # pip install gapminder
import plotly.graph_objects as go
import seaborn as sns
import os
%matplotlib inline
# Upgrade pandas to use dataframe.explode() function.
#!pip install --upgrade pandas==0.25.0
#file_csv = os.path.abspath('tmdb-movies.csv')
data = pd.read_csv(r"C:\Users\kater\Desktop\tmdb-movies.csv")
#data = pd.read_csv(file_csv)
print(data.columns)
data.head(3)
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
3 rows × 21 columns
# info of dataset
data.info()
print('Dataframe contains {} rows and {} columns. '.format(data.shape[0],data.shape[1]))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB Dataframe contains 10866 rows and 21 columns.
Before answer the questions we need to prepare and clean our dataset.
First, lets drop columns. We will only keep the columns we need and remove the rest of them.
Columns to delete - id, imdb_id, budget_adj, revenue_adj, homepage, overview, production_companies, vote_count and vote_average.
# list of columns that needs to be deleted
del_col = [ 'id', 'imdb_id', 'budget_adj', 'revenue_adj', 'homepage', 'overview' ,
'vote_count','cast','homepage','budget_adj', 'revenue_adj','tagline']
#deleting the columns from the database
data = data.drop(del_col, 1)
data.head(5)
C:\Users\kater\AppData\Local\Temp/ipykernel_10756/1607617444.py:7: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only data = data.drop(del_col, 1)
| popularity | budget | revenue | original_title | director | keywords | runtime | genres | production_companies | release_date | vote_average | release_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Colin Trevorrow | monster|dna|tyrannosaurus rex|velociraptor|island | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 6.5 | 2015 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | George Miller | future|chase|post-apocalyptic|dystopia|australia | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 7.1 | 2015 |
| 2 | 13.112507 | 110000000 | 295238201 | Insurgent | Robert Schwentke | based on novel|revolution|dystopia|sequel|dyst... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 6.3 | 2015 |
| 3 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | J.J. Abrams | android|spaceship|jedi|space opera|3d | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 7.5 | 2015 |
| 4 | 9.335014 | 190000000 | 1506249360 | Furious 7 | James Wan | car race|speed|revenge|suspense|car | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 7.3 | 2015 |
The difference between the row database and cleaned obvious and more convinient for analysis.
Now lets clean any duplicated rows.
#drop duplicate rows, but keep first one
data.drop_duplicates(keep = 'first', inplace = True)
rows, col = data.shape
print('We now have {} total entries of movies and {} columns.'.format(rows-1, col))
We now have 10864 total entries of movies and 12 columns.
Which movies have a value of '0' in their budget, revenue and vote average? Let's delete these movies with null values from database
# check if the dataset on null values
# count the total number of rows in each column, which contain null values
data.isna().sum()
popularity 0 budget 0 revenue 0 original_title 0 director 44 keywords 1493 runtime 0 genres 23 production_companies 1030 release_date 0 vote_average 0 release_year 0 dtype: int64
# list of column names that needs to be cleared
clean_row = ['keywords', 'production_companies']
# replace the value of '0' to NaN
data[clean_row] = data[clean_row].replace(0, np.NaN)
#now we will drop any row which has NaN values from clean_row
data.dropna(subset = clean_row, inplace = True)
rows, col = data.shape
print('Now we have only {} entries of movies and {} columns.'.format(rows-1, col))
Now we have only 8704 entries of movies and 12 columns.
and now, after cleaning left only 3853 in 12 columns.
# save the cleaned data
data.to_csv('new_tmdb_movies.csv', index=False)
#Line plot
release_each_year = data.groupby('release_year')['revenue'].sum()
#the stylesheet
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
#giving the size
plt.figure(figsize=(15,5), dpi = 80)
#labeling axis
plt.xlabel('Release Year', fontsize = 12)
plt.ylabel('Total Profits [Mrd. Dollar]', fontsize = 12)
#title
plt.title('Revenue earned by all films in the year they were released')
#plot
plt.plot(release_each_year)
#showing the plot
plt.show()
# year of movies with most revenue
release_each_year.idxmax()
2015
# year of movies with least revenue
release_each_year.idxmin()
1966
# DataFrame to get a better visual output for the last years
release_each_year = pd.DataFrame(release_each_year)
release_each_year.tail()
| revenue | |
|---|---|
| release_year | |
| 2011 | 23327721158 |
| 2012 | 24383279750 |
| 2013 | 24031827285 |
| 2014 | 24024972183 |
| 2015 | 26633137063 |
2015 was the year where movies made the highest revenue.
More then 26 billion dollars was released in this year.
For compare just 847 Millons was released in 1966 year.
Now we want to find similar characteristics of most profitable movies.
# First of all, let us create new column 'profit'
# As we know profit = revenue - budget
data['profit'] = data['revenue'] - data['budget']
# Creating new function discover_min_max(x) to find min and max profit movies.
def discover_min_max(x):
# 'idxmax' and 'idxmin' used to return index of first occurrence of maximum/minimum over requested axis.
high_in = data[x].idxmax()
low_in = data[x].idxmin()
#Indicating the index
high = pd.DataFrame(data.loc[high_in,:])
low = pd.DataFrame(data.loc[low_in,:])
#movie with high and low profit
print("Movie with Highest profit is" ,data['original_title'][high_in],".")
print("Movie with Lowest profit is",data['original_title'][low_in],".")
return pd.concat([high,low],axis = 1)
discover_min_max('profit')
Movie with Highest profit is Avatar . Movie with Lowest profit is The Warrior's Way .
| 1386 | 2244 | |
|---|---|---|
| popularity | 9.432768 | 0.25054 |
| budget | 237000000 | 425000000 |
| revenue | 2781505847 | 11087569 |
| original_title | Avatar | The Warrior's Way |
| director | James Cameron | Sngmoo Lee |
| keywords | culture clash|future|space war|space colony|so... | assassin|small town|revenge|deception|super speed |
| runtime | 162 | 100 |
| genres | Action|Adventure|Fantasy|Science Fiction | Adventure|Fantasy|Action|Western|Thriller |
| production_companies | Ingenious Film Partners|Twentieth Century Fox ... | Boram Entertainment Inc. |
| release_date | 12/10/09 | 12/2/10 |
| vote_average | 7.1 | 6.4 |
| release_year | 2009 | 2010 |
| profit | 2544505847 | -413912431 |
# Read the cleaned dataset
df_clean = pd.read_csv('new_tmdb_movies.csv')
# Get a list of the top 10,50,100 movies for further visualisatons
df_top10 = df_clean.nlargest(10,'vote_average')
df_top50 = df_clean.nlargest(50,'vote_average')
df_top100 = df_clean.nlargest(100,'vote_average')
df_worst10 = df_clean.nsmallest(10,'vote_average')
df1 = df_top100.reindex(columns=['vote_average','original_title','release_year'])
df1.head()
# list of most voted average movies
| vote_average | original_title | release_year | |
|---|---|---|---|
| 5261 | 8.7 | Pink Floyd: Pulse | 2006 |
| 2826 | 8.5 | The Art of Flight | 2011 |
| 6520 | 8.5 | Queen - Rock Montreal | 1981 |
| 437 | 8.4 | The Jinx: The Life and Deaths of Robert Durst | 2015 |
| 3164 | 8.4 | The Shawshank Redemption | 1994 |
# list of least voted average movies
df2 = df_worst10.reindex(columns=['vote_average','original_title','release_year'])
df2.tail(5)
| vote_average | original_title | release_year | |
|---|---|---|---|
| 3678 | 2.2 | Foodfight! | 2012 |
| 3704 | 2.2 | The Sleeper | 2012 |
| 4561 | 2.2 | Atlantic Rim | 2013 |
| 5509 | 2.3 | Dracula 3000 | 2004 |
| 819 | 2.4 | Age of Tomorrow | 2014 |
Above we see information about top 10 movies and worst 10 movies,
sorted exclusively according to average vote.
Let's visualize it with help of bubble chart.
# Bubble chart plot of 10 highest rate moves
fig = px.scatter(df_top10, x="original_title", y="vote_average",
color="vote_average",
size='vote_average', title= "Top 10 Highest Rate Movies",
template = 'plotly')
# Update/Change Layout
fig.update_layout(
title_font_size=18,
title_font_family='Arial',
font=dict(color="DarkBlue"))
fig.update_layout(margin=dict(t=40, b=0, l=0, r=0))
sns.set(rc={'figure.figsize':(5,5)})
fig.show()
# Bubble chart plot of 10 worst rate moves
fig = px.scatter(df_worst10, x="original_title", y="vote_average",
color="vote_average",
size='vote_average', title= "Top 10 Worst Rate Movies",
template = 'plotly')
# Update/Change Layout
fig.update_layout(
title_font_size=18,
title_font_family='Arial',
font=dict(color="DarkBlue"))
fig.update_layout(margin=dict(t=40, b=0, l=0, r=0))
sns.set(rc={'figure.figsize':(5,5)})
fig.show()
Most average voted Movies: The Schawshank Redemption, Stop Making Sence, The Goodfather.
Least average voted Movies: Foodfight, Dracula 3D, FearDotCom
year_data = data.release_year.value_counts(ascending=False).reset_index()
print(year_data.transpose().head())
0 1 2 3 4 5 6 7 8 9 ... \
index 2014 2013 2015 2009 2012 2011 2008 2007 2010 2006 ...
release_year 525 458 444 414 390 386 382 351 343 327 ...
46 47 48 49 50 51 52 53 54 55
index 1970 1967 1972 1968 1963 1965 1960 1962 1961 1969
release_year 39 36 36 36 33 33 30 30 28 25
[2 rows x 56 columns]
Great! Now we have list with information - how many movies were made yearly.
And we can easy find in which year were made more and least movies.
# Most movies
print(year_data['index'].max())
print(year_data['release_year'].max())
print("The most Movies (", year_data['release_year'].max(), ") was released in year" +"",year_data['index'].max(),".")
2015 525 The most Movies ( 525 ) was released in year 2015 .
# Least movies
print(year_data['index'].min())
print(year_data['release_year'].min())
print("Least Movies (", year_data['release_year'].min(), ") was released in year" +"",year_data['index'].min(),".")
1960 25 Least Movies ( 25 ) was released in year 1960 .
#let's check the values obtained
#How many movies were made in the year 2010?
len(data[data.release_year==2010])
343
199 Movies were released in year 2011
and just 4 in 1969 year.
The genre column is made up from a string of genres separated by pipes"|".
We need to divide the movies into groups based on genres to answer the question.
Otherwise, we will have to analyze 1792 combination of genres.
# number of unique genre combinations
data.genres.nunique()
1792
Instead, we will create special function for splitting genre
# let's do it with help of function genre_split
def genre_split(x):
#concatenate all the rows of the genrs
dt_plot = data[x].str.cat(sep = '|')
dt = pd.Series(dt_plot.split('|'))
genre = dt.value_counts()
return genre
# the function to count the movies of each genre
total_gen = genre_split('genres')
total_gen.plot(kind= 'bar',figsize = (15,4),
fontsize=12,
colormap='gist_gray')
# title and the labels
plt.title("Genre With Highest Release",fontsize=12)
plt.xlabel('Genres',fontsize=12)
plt.ylabel("Number of Movies",fontsize= 12)
sns.set_style("whitegrid")
Drama has the most releas of movies(about 3900).
Second and third place are taken by Comedy and Thriller genre.
# Total Films shot
totaly_movie = total_gen.sum()
totaly_movie
22170
# Of them Drama, Comedy and Thriller
totaly_drama = total_gen['Drama'].sum()
totaly_comedy = total_gen['Comedy'].sum()
totaly_thriller = total_gen['Thriller'].sum()
totaly_drama
3939
# Percent of Drama, Comedy and Thriller
drama_percent = totaly_drama*100/totaly_movie
comedy_percent = totaly_comedy*100/totaly_movie
thriller_percent = totaly_thriller*100/totaly_movie
print("Percent of drama genre: "+"%.2f" % drama_percent)
print("Percent of comedy genre: "+"%.2f" % comedy_percent)
print("Percent of thriller genre: "+"%.2f" % thriller_percent)
Percent of drama genre: 17.77 Percent of comedy genre: 13.79 Percent of thriller genre: 11.11
We have identified which genres more and least popular.
Absolute leader is Drama (17,8%), next Comedy (13,8%) and Thriller(11,1%).
Below is another visualisation art - pie chart diagramm,
which gives an idea of releationship between year and genres for Top 50 Movies.
# Pie chart diagramm
fig1 = px.sunburst(df_top50, path=['release_year','genres'], #root,branches,leaves
maxdepth = -2,
title = "Pie chart year/genres overview of TOP 50 movies",
template = 'seaborn')
fig1.update_layout(
title_font_size=20,
title_font_family='Arial',
font=dict( family ="sans-serif",
size=12,
color="DarkBlue")
)
fig1.update_layout(margin=dict(t=30, b=0, l=0, r=0))
fig1.update_traces(textinfo='label + percent parent') #percent entry - values adopt to entry
fig1.show()
At the pie chart we can see multiple-variable exploration of release year, genre of the movies.
We can see that the greatest number from top 50 were produced in year 2014, 1999, 1994 and in 2015.
What is actually surprising for me, that so many top voted movies were released before 2000.
Totally majority of the top movies has Drama genres.
Important to say, this exploration is only for top 50 movies and not for complete dataset, therefore chart indicates incomplete picture about what genres are most often filmed.
#grouping of data by years of release from 1960 to 2015 with time period - 10 years.
data.groupby('release_year').mean()['runtime'].plot(xticks = np.arange(1960,2015,10))
#the figure size.
sns.set(rc ={'figure.figsize':(15,5)})
#setup the title of the figure
plt.title("Runtime of Movies",fontsize = 15)
#labeling
plt.xlabel('Year',fontsize = 13)
plt.ylabel('Average Runtime',fontsize = 13)
#the stylesheet
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
data["runtime"].mean()
103.61344055140724
From this plot we can see that:
This Dataset is rich on information. Here is my conclusions:
The online documentations of pandas, numpy, and matplotlib.
In case of errors i used https://stackoverflow.com/
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
4294967295